Set-based Similarity Measurement and Ranking Model to Identify Cases of Journalistic Text Reuse

نویسندگان

  • Arpan Pal
  • Lee Gillam
چکیده

In this paper, we describe our approach to linking news articles in a cross lingual environment, English and Hindi, as submitted for the CrossLingual Indian News Story Search (CL!NSS)[1] task at FIRE'13. In our approach, English documents are first converted to Hindi using Google Translate[2], and compared to the potential Hindi sources based on five features of the documents: title, the content of the article, unique words in content, frequent words in content, and publication date. A weighted combination of the five individual similarity scores provides an overall value for similarity. Results are promising, with a best Normalized Discounted Cumulative Gain (NDCG) to ranks 1, 5 and 10 (NDCG@1, NDCG@5, NDCG@10) of 0.6600, 0.5579, and 0.5604 respectively. These place the system in third by organization, and 5th by run.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building and annotating a corpus for the study of journalistic text reuse

In this paper we present the METER Corpus, a novel resource for the study and analysis of journalistic text reuse. The corpus consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events, as published in various British newspapers. In some cases the newspaper stories are rewritten from the PA source; in other ...

متن کامل

Frame Labeling of Competing Narratives in Journalistic Translation

Studying translations during the time of conflict has gained currency in the recent decade in translation studies. One of the cases in which conflict manifests itself is in the way different countries choose to name an event or a geographical location, for example. This study set out to understand how translation of rival names and labeling was carried out in Iranian state-run news agencies. To...

متن کامل

A NOVEL FUZZY-BASED SIMILARITY MEASURE FOR COLLABORATIVE FILTERING TO ALLEVIATE THE SPARSITY PROBLEM

Memory-based collaborative filtering is the most popular approach to build recommender systems. Despite its success in many applications, it still suffers from several major limitations, including data sparsity. Sparse data affect the quality of the user similarity measurement and consequently the quality of the recommender system. In this paper, we propose a novel user similarity measure based...

متن کامل

Text Reuse Detection using a Composition of Text Similarity Measures

Detecting text reuse is a fundamental requirement for a variety of tasks and applications, ranging from journalistic text reuse to plagiarism detection. Text reuse is traditionally detected by computing similarity between a source text and a possibly reused text. However, existing text similarity measures exhibit a major limitation: They compute similarity only on features which can be derived ...

متن کامل

Postgraduate Transfer Report.PDF

This thesis builds upon our current understanding of text reuse by proposing a hypothetical framework of text reuse and applying this abstract definition to a specific domain, that of journalistic reuse. The framework aims to explore a suitable measure of reuse and determine suitable discriminators for document derivation. Although text can be reused verbatim (word-for-word), in most cases, tex...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013